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Abstract — Mobile multi-robot teams deployed for monitoring or search-and-rescue missions in urban disaster areas can greatly improve 
the quality of vital data collected on-site. Analysis of such data can identify hazards and save lives. Unfortunately, such real deployments 
at scale are cost prohibitive and robot failures lead to data loss. Moreover, scaled-down deployments do not capture significant levels 
of interaction and communication complexity. To tackle this problem, we propose novel mobility and failure generation frameworks that 
allow realistic simulations of mobile robot networks for large scale disaster scenarios. Furthermore, since data replication techniques 
can improve the survivability of data collected during the operation, we propose an adaptive, scalable data replication technique that 
achieves high data survivability with low overhead. Our technique considers the anticipated robot failures and robot heterogeneity to 
decide how aggressively to replicate data. In addition, it considers survivability priorities, with some data requiring more effort to be 
saved than others. Using our novel simulation generation frameworks, we compare our adaptive technique with flooding and broadcast- 
based replication techniques and show that for failure rates of up to 60% it ensures better data survivability with lower communication 
costs. 

Index Terms — adaptive scalable data survivability mobile robot networks robot failure models robot heterogeneity urban disaster 
environments 
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1 Introduction 

Recent technology advancements make robots valuable part- 
ners in real-life mission-critical scenarios. Robots proved cru- 
cial in disaster and rescue missions such as terrorist attacks on 
civilians f[3Tll36l . natural disasters such as hurricanes ll33l l37ll 
and earthquakes ETI l22l . and life-threatening mining acci- 
dents [ 32 1 . However, despite these advances, there is still a 
significant gap between laboratory work and real-life environ- 
ments, as acknowledged by various robotic competitions 11391 
and on-going field research in this area J43j. 

In ad-hoc collaborating robot teams this gap is highly 
noticeable. Current robust deployments are typically done at 
a small scale and using teleoperation from human coordi- 
nators for reconnaissance and mapping of a disaster area, 
assessment of the damages and identification of dangerous 
zones (e.g., J3] [26] [33]). Robots in such teams typically do 
not communicate with each other, but collect and transmit 
information either wirelessly or through tethering to a base 
station. 

Deployments where robots are more autonomous and col- 
laborate with each other to achieve a global task in critical 
missions |29| have not yet proven to scale to hundreds 
or thousands of robots. Thus, the full potential of large 
teams of collaborating robots collecting crucial information 
in time-pressure, life-threatening, real-life scenarios is yet 
to be realized. For example, such a scenario could involve 
thousands of micro-robots (e.g., inch-sized robots |[12| . palm- 
sized helicopters or quadrocopters [5]) or even fly-sized 



robots l48l . to collect data on human survivors in a cataclysm. 
At this scale, humans can be only partially involved in control- 
ling them; instead, the robots should reliably collaborate and 
coordinate with each other and the environment in an ad-hoc 
manner. 

A major obstacle that delays the appearance of such large- 
scale self-coordinating mobile robot networks is the reliability 
of the robots in the presence of environmental hazards. High 
failure rates lead to significant loss of data. Data replication 
can be used to improve the overall data survivability. However, 
existing techniques proposed in literature, and especially for 
mobile ad-hoc networks (MANETs), do not work well in 
mobile robot networks targeting disaster scenarios for three 
reasons. First, these solutions assume homogeneous nodes, 
while robot teams may be heterogeneous, with different ca- 
pabilities and failure rates. Second, robot failures are not 
always independent as it is assumed in these methods. Third, 
the existing techniques have been designed for much denser 
networks that typically ensure network connectivity between 
any two nodes. Robot networks, on the other hand, are 
sparse, and reachability between nodes cannot be guaranteed, 
especially in urban-disaster scenarios. Thus, sending all the 
collected data to a base station in real-time is not possible. 

To tackle the problem of data survivability under realistic 
mobile robot network conditions, we propose an adaptive, 
scalable data replication technique that takes into account the 
neighborhood robot density and the rate and type of robot 
failures to decide how aggressively to replicate data for higher 
survivability. This technique is based on opportunistic com- 
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munication, similar to delay tolerant networking, to maintain 
system flexibility, reliability and robustness in face of robot 
and network failures fl3l . In addition, inspired from real- 
world scenarios, our technique includes a new parameter that 
represents the survivability requirements of data produced by 
robots. This parameter reflects the importance of different data 
types to the human crews and determines their replication 
needs during the mission. 

We also propose novel frameworks for generating realistic 
mobility and failure scenarios for teams of robots working in 
urban disaster environments, based on a thorough analysis of 
prototype robots in real-life monitoring situations [7, 8, 9|. The 
mobility generation framework allows variable type, number, 
and size of deployment areas, variable type and number of 
robots deployed in each area, and different type of mobility 
for each area. Furthermore, the failure generation framework 
realizes three types of failures: independent robot failures, 
specific to technical malfunctions; independent area failures, 
specific to problems such as collapsed bridges or buildings; 
and clustered areas failures, specific to major disasters such 
as explosions or broken levees. Via simulations using these 
frameworks, and under extreme failure rates, we demonstrate 
that our adaptive replication technique results in higher overall 
data survivability than baseline techniques for comparable 
communication costs. 

The rest of this paper is organized as follows: Section [2] 
describes how large-scale mobile robot teams could work in 
real-life situations. Section [3]presents the design of our mobil- 
ity generation framework when considering real-life scenarios 
of robotic missions in large operational areas where robots 
are assigned roles and areas to operate in. Section [4] defines 
analytically the failure generation framework considered in 
this study and proposes three failure models applicable to 
various real mission scenarios. Section [5] discusses replication 
techniques for data survivability. We evaluate and compare 
the replication techniques using our mobility and failure 
generation frameworks via extensive simulations in Section [6] 
Section [7] discusses related work, and Section [8] elaborates on 
our experimental findings and concludes this study. 

2 Mobile Robot Team in an Operational 
Scenario 

Today, various types of robots are used in search and res- 
cue missions and other human-robot tasks. These robots are 
unmanned vehicles with an autonomic operation of a few 
hours, depending on the usage of the on-board components. A 
typical robot, such as the ones operated by CRAS AR [ 1 1 1 and 
shown in Figure [T] has a powerful system on-board, usually 
close to today's PCs. Along with the computation platform, 
other hardware modules installed may include GPS, inertial 
measurement unit, laser, cameras, temperature sensors, and 
network cards. 

A potential operating scenario for a team of robots is 
presented in Figure [2] After a strong hurricane, much of the 
civilian infrastructure (bridges, roads, and buildings) has been 
affected. The deployment of human personnel for assessing the 
damage is considered extremely dangerous, and the authorities 



decide to deploy a team of robots of various types, such as 
unmanned ground vehicles (UGVs), unmanned aerial vehicles 
(UAVs), and unmanned surface vehicles (USVs). The three 
bridges need to be inspected to assess the magnitude of the 
damage. This scenario emphasizes the distributed interaction 
between robots and identifies the parameters that need to be 
modeled for these networks. 

Three "fleets" of USVs enter the bay and each fleet is 
assigned the task to examine one bridge. A fleet consists of a 
mother ship which deploys teams of robots comprised of an 
USV and two rotary wing low-altitude micro-aerial vehicles. 
These teams are assigned to portions of each bridge. The aerial 
vehicles provide support to the surface vehicles. Each mother 
ship circles the sub-bay area of the bridge to provide support 
and delay-tolerant network connectivity. In addition, there 
are a low-altitude fixed wing vehicle and a ground vehicle 
providing surveillance for each bridge. Eight swarms of mid- 
altitude fixed wing aerial vehicles cover the coast. Each swarm 
has five UAVs flying in a coverage pattern over a portion of 
the coastal area. Two higher altitude aerial vehicles circle the 
remaining ground area to provide a total aerial coverage. The 
bridges are also inspected by three convoys of ten UGVs each. 

The teams produce data using their on-board sensors and 
use ad-hoc networking to coordinate with each other. When 
the mission ends, they travel to a predefined location, where 
data are collected and analyzed in greater detail to assess the 
damage to each bridge. Therefore, it is of critical importance 
that the data produced by each robot are stored in the system 
in a resilient manner to survive the mission. Currently, the 
survivability of data is directly dependent on the fault tolerance 
of the robots and their intelligence to stay off dangerous paths 
or areas until the mission is over. Unfortunately, in this type of 
disaster missions, robots fail frequently due to environmental 
conditions (e.g., explosions, collapsed buildings, etc.) as well 
as hardware problems. Thus, ad-hoc networking and collabora- 
tion between robots must be used to increase data survivability. 

The operational scenario described above helps us identify 
the following parameters to be modeled in our mobility and 
failure generation frameworks: (1) the scale of the operation 
area that all robots are situated in, (2) the types and sizes of 
sub-areas that particular robots are assigned to work in, (3) the 
types and number of robots available during the operation, 
acquiring different mission roles and assigned in particular 
types of sub-areas, (4) the types of mobility for particular types 
of robots, (5) the types and rates of failures anticipated by 
particular types of robots during the mission. We explore these 
operational parameters in the next sections, where we describe 
in more detail the mobility and failure generation frameworks 
considered in this study. 

3 Modeling Mobility in Robot Net- 
works 

Because large-scale real deployments are cost prohibitive and 
scaled-down deployments do not capture a significant level 
of interaction and communication complexity for our study, 
simulations are necessary for understanding the behavior of 
mobile robot networks at large scales. To realistically simulate 
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Fig. 1 . Examples of unmanned (ground, aerial and surface) robotic vehicles used in human-robot missions 




Fig. 2. Operational scenario of a multi-robot littoral deployment. Continuous circles represent small areas of the bridge 
scanned by UAVs and USVs. The dotted elliptical line gives the flight path of a low altitude UAV monitoring the small 
teams assigned to each bridge. Light shaded areas represent swarms of UAVs, whereas dark long dashed lines show 
the coverage offered by higher altitude UAVs. 



a mobile robot network, we created a framework that mod- 
els robot assignment to particular areas, and robot mobility 
according to their mission role. This framework takes into 
consideration the following characteristics of mobile robot 
networks: 

• Mobile robot networks are sparse, with node degrees of 
2 or 3. 

• The robots are not uniformly distributed in the area of 
operation; instead, they are clustered in relatively small 
regions and assigned to work collaboratively on a task. 
Large parts of the operation area may, at times, have no 
robot presence. 

• The operation area in a disaster scenario could spread 
over tens of squared kilometers (e.g., the scenario pre- 
sented in Section [5] covers about 25Km 2 ). 

• Depending on their assigned task, robots could move sys- 
tematically to cover all their working area when mapping 



or searching for a particular item or have a random-like 
mobility pattern when more greedy searching approaches 
are used. 

• Usually, end-to-end communication is not required be- 
tween all pairs of robots because of localized collabora- 
tion. Consequently, routing between robots of the same 
cluster is not a major problem since clusters that need 
routing have nodes with low mobility (e.g., \-5m/s). 

• Robots in such large deployments can be highly hetero- 
geneous, with different capabilities, different failure rates, 
and different assignments and roles: in one scenario [41 1, 
for example, search robots scan a disaster area for victims, 
while relay robots propagate data. More powerful nodes, 
such as UAVs and USVs can act as coordinators or data 
collectors for the others (e.g., UGVs). 

Based on these characteristics, we identified the following 
parameters for modeling mobile robot networks: (1) area 
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type, (2) area size, (3) robot types per area, (4) number of 
robots per area, and (5) type of mobility per area. Figure [5] 
illustrates a network scenario with three types of areas and the 
corresponding types of robots assigned to each. The specifics 
of these types of areas and robots are discussed next. 

Area Types in Operation Area: Typical tasks of robot 
networks include missions that cover large portions of terrain, 
water, or both. Thus, the overall operation area can be as large 
as tens of squared kilometers. In such an operation area, we 
define working, connecting and monitoring areas. We note that 
the mobility, type and number of robots appointed to each of 
these areas are intrinsically connected to the mission roles 
assigned to the robots and the subtasks they are to perform. 

Working Areas are typically covered by small teams of 
collaborative robots that perform various subtasks under a 
particular role like scouting. These robots, scouts, move slowly 
(l-5m/s), acquiring and analyzing data, while inspecting an 
area (e.g., checking the structural integrity of bridge legs and 
damaged buildings), and searching for victims. They react 
and adapt to the current context and collaborate in groups 
to achieve a global task. Since scouts are assigned to areas 
expected to fail, valuable data collected by their on-board sen- 
sors must be replicated in order to increase their survivability. 
The working area size depends on the mission, but typically 
is about the size of a building (e.g., 100m x 100m). These 
areas are either known before the mission starts or determined 
dynamically as a function of the changing context in the field. 
In our framework, we consider that working areas do not 
overlap, but robots from neighboring regions can communicate 
with each other if they are in transmission range. 

Connecting Areas: The operation area could be very large 
and scouts in different working regions could form multiple 
disconnected network partitions. More powerful robots cover 
connecting areas, moving across or around the working areas, 
interacting with or controlling the scouts. These robots are 
given the role of archivist. The archivists can be large vehi- 
cles that move slowly enough to manage their way through 
potential obstacles (5-10m/s). They collect and log data from 
scouts in on-board resilient storage until the end of the mission 
and connect the network partitions in a delay tolerant fashion, 
providing support for data mulling l44l . In an urban scenario 
with a grid road layout, the connecting areas could be avenues 
and streets. In general, they can be long stripes covering the 
whole length or width of the operation area. Similar to the 
working regions, they can be known a priori or assigned 
dynamically as a function of the position of the working 
regions and the paths taken by the human crews. 

Monitoring Areas: UAVs flying at fairly low altitudes and 
at speeds up to 20-30m/s can offer a high-level monitoring 
of the operation area and alert the human crews in case of an 
emergency. Several monitoring regions can cover the entire 
operation area, but do not need to overlap. During their pass 
over the working regions, these UAVs can also log data from 
scouts on resilient storage until the mission ends, realizing the 
combined roles of supervisor and archivist. 

Robot Mobility : Our framework allows specifying different 
mobility models for individual robots or groups of robots, 
depending on the heterogeneity of the network. In choosing 



the mobility pattern for robots in each area, we must consider 
the mission type and the robot roles. For example, a supervisor 
could be instructed to move with a certain average speed 
between all working areas within its monitoring area, and 
slow down when flying over them. An archivist could move 
through its assigned area and slow down when communication 
with scouts is necessary. Finally, for scouts in a small working 
area, the random waypoint mobility could work just fine. More 
complex group mobility or mobility driven by the mission's 
context can be considered as well. 

4 Modeling Failures in Robot Net- 
works 

Node failures in robotic networks or MANETs are usually 
considered independent (if failures are considered at all). Only 
few studies ll27l have considered clustered node failures that 
affect multiple nodes at the same time. In addition to the 
commonly examined scenario of isolated technical problems, 
such as motor issues (e.g., due to small rocks, ponds, vege- 
tation or sandpits) or running out of battery, in this study we 
consider two clustered node failure scenarios that we model 
next. First, isolated explosions or building collapses can affect 
multiple robots working in the same area. Second, unstable 
environmental conditions such as a big gas explosion or a 
bridge collapsing can cause failures to multiple such areas 
and their assigned robots. Furthermore, for each of these 
failure scenarios there is an associated probability of a robot's 
neighborhood to fail. In Section |4.2| we present analytical 
models that enable a robot to estimate this probability and 
adjust its data replication accordingly (used in Section [5}. 

4.1 Failure Models 

We propose a novel failure generation framework, that im- 
plements three types of failures: independent robot failures, 
independent area failures, and clustered area failures. In this 
framework we consider permanent failures of robots: when a 
failure happens, the robot stops moving and communicating 
and the on-board collected data are considered irretrievable. 

Model 1: Independent robot failures. This model covers 
the most frequently occurring scenario where robots fail in- 
dependently of each other. In this model, each robot type has 
an associated failure rate. Usually, scouts experience higher 
failure rates because of their assigned tasks (working areas), 
while the archivists and supervisors are the most reliable due to 
their powerful resources and less cluttered paths. We consider 
these failures are uniformly distributed over the operation 
area and the duration of the mission, as a function of a pre- 
established robot failure rate based on the particular type of 
robot and task assigned. 

Model 2: Independent area failures. This model covers 
the real-life scenarios where all robots in a working area 
fail altogether (e.g., due to a building collapse). We consider 
these failures are uniformly distributed over the working areas 
and duration of the mission, using a pre-defined area failure 
rate. Thus, during the operation, working areas are randomly 
selected to fail and all robots assigned to such areas fail 
simultaneously. 
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Fig. 3. Example of generated mobility scenario. The boxes represent different types of areas where robots are 
assigned and allowed to move. The robots stay in their areas, but interact with their group-mates and other areas' 
robots. Small squares with continuous lines are working areas, stripes (vertical or horizontal) with dashed lines are 
connecting areas, and dotted lines are monitoring areas. Black circles represent scouts, grey squares represent 
archivists, and grey triangles represent supervisors. 



Model 3: Clustered area failures. This model covers the 
case where an event affects multiple neighboring working 
areas, failing all their containing robots: a big explosion, a 
land slide, or a broken levee could lead to subsequent or 
simultaneous failures in neighboring areas. We use a pre- 
defined failure rate to define the total number of failing 
neighboring working areas to account for the scale of the 
event. Thus, during the operation, a random working area A 
is selected to fail, and all working areas neighboring A also 
fail simultaneously. The number of neighboring areas to fail 
depends on the pre-defined failure rate. All robots assigned 
to these clustered areas fail simultaneously. The difference 
between independent area failures and clustered area failures 
of the same failure rate is in the location of failing robots: 
in scattered, randomly selected areas in the former case, or 
all concentrated in neighboring areas in the latter case. The 
two models have different outcomes in terms of network 
connectivity/partitioning and, as shown by our experimental 
work, in the performance of data replication algorithms. 

The above models assume that during a particular mission, 
one of the three types of failures dominates the operation. 
The human crews can assess which model to use given 
the environmental conditions and the type of disaster under 
investigation. Also, they can over-estimate the expected failure 
rate and reduce it to lower levels when they have updated 
information on the mission's status. 

We can also combine the above failure types to produce 
a more complex model based on a mixture of failure rates 
for each of the models, where robots can fail independently 
of each other (as in the first model), in small independent 
groups (as in the second model) or in large clustered groups 
(as in the third model). In this work we examine individually 
each of the three models; in the future, we plan to explore 
their combination into a more complex model. Next, given a 
particular failure model, we define analytically the combined 
probability of failure of a neighborhood of robots, as a function 



of the different type and number of robots or areas comprising 
this neighborhood and their expected failure rates. 

4.2 Failure Probability 

Data collected by the on-board sensors of a robot such as 
a scout can be lost due to robot failures. Data replication 
to nearby robots, especially to more reliable types of robots 
such as archivists or supervisors, can increase the survivability 
of these data. During a mission, if a robot can assess the 
probability of its neighboring robots to fail, it could adjust 
how aggressively to replicate data. Therefore, the Failure 
Probability (FP) of the robot's neighborhood can be estimated 
given the expected failure model, and the failure rate and 
number of neighboring robots (first model) or failure rate and 
number of neighboring areas (second and third model). FP, 
defined next for each failure model, implies that fewer robots 
or working areas in the neighborhood increases the probability 
for data loss, but it also depends on the robot or area types. 
For example, if the neighboring robots are only scouts, the 
failure probability will be higher than if there is a neighboring 
archivist as well. 

During a mission under failure model m <E M = { 1 , 2, 3}, we 
assume the following for the neighborhood of a robot: 

• There are robots of different types i e RT = {1, 2, 3, ... } 
(e.g., i=l for scouts, i=2 for archivists, ;'=3 for supervi- 
sors, etc). 

• There are a, number of robots of a type ;' <E RT. 

• Each robot type has an anticipated failure rate p,. 

• There are neighboring robot areas of different types j e 
AT = {1,2,3,...} (e.g., j=l for working areas, j=2 for 
connecting areas, j=3 for monitoring areas, etc). 

• There are bj number of areas of a type j e AT. 

• Each robot area has an anticipated failure rate Xj. 

Then, we define the Failure Probability FP m for each failure 
model m e M as follows: 
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For independent robot failures (m =1) and robot types i € 
RT: 
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For independent area failures (m = 2) and robot area types 
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For clustered area failures (m = 3) and robot area types 
jeAT: 
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As the three proposed models show, the calculation of FP 
takes into account multiple types of robots and robot areas. 
To demonstrate how the failure probability changes across 
different failure models and rates, we consider the following 
example, where we limit the types of robots to scouts (SC) 
and archivists (AR) and the types of areas to working (WA) 
and connecting areas (CA). In this example, we assume that 
a robot calculating the FP of its neighborhood encounters the 
following four typical scenarios of combinations of robots and 
areas: 

1) Two scouts assigned to the same working area (2SC- 
IWA) 

2) Two scouts assigned to two different working areas 
(2SC-2WA) 

3) Two scouts assigned to the same working area, and one 
archivist assigned to a connecting area (2SC-1AR-1WA- 
ICA) 

4) Two scouts assigned to two different working areas, and 
one archivist assigned to a connecting area (2SC-1AR- 
2WA-ICA) 

Figure [4] illustrates the failure probability given a particular 
scenario, under the three failure models and three failure 
rates applied. We set the failure rate of the archivists (and 
respectively of the connecting areas) to 1/4 of the failure 
rate of the scouts (and respectively of the working areas), 
since we assume they exhibit higher reliability due to better 
hardware and safer task assigned. The results verify our 
intuition that the lowest FP for all scenarios and failure rates is 
anticipated at the independent robot failures. Furthermore, the 
combination of robot types and areas assigned greatly affects 
the estimated failure probability. For example, the presence 
of a single archivist in the neighborhood of the transmitting 
robot can lower the failure probability for all failure types 
and rates, in comparison to just having neighboring scouts. 
Also, having neighboring scouts assigned to different working 
areas decreases the FP when comparing independent robot and 
area failures. The highest FP is expected in the clustered area 
failures where multiple robots from neighboring areas can fail 
simultaneously. In fact, in some scenarios there is a multifold 
increase of FP between independent robot and area failures 
(e.g., 2SC-IWA and 2SC-IAR- \WA-\CA under a failure rate of 
0.1), or between independent area and clustered failures (e.g., 
2SC-2WA and 2SC-1AR-2WA-1CA under a failure rate of 0.1). 



5 Replication for Data Survivability 

Data replication can increase the level of data survivabil- 
ity across the robot network when under robot failures. In 
such disaster missions, the data collected may have different 
survivability requirements, depending on their importance to 
the human crews. For example, infrared video footage inside 
a collapsed mine can be used to identify locations where 
people are trapped, thus could require high survivability. Also, 
radiation level readings inside a nuclear reactor after a power 
plant accident [23] can help the emergency crews identify 
dangerous locations, or when it is safe to enter the facility and 
perform repairs. However, humidity level readings might not 
be as important to survive the mission, therefore could require 
low survivability. Thus, adapting replication to the type of data 
can make better use of network resources while better serving 
the requirements for data survivability. 

This section discusses techniques for data replication for 
increased survivability with respect to cost of communication 
overhead, storage space, and power consumption. Understand- 
ing the tradeoff between the degree of data replication and the 
network costs is the objective of our work. To this end, we start 
our investigation with two intuitive techniques that do not take 
into account robot or data type in the data replication decision: 
(1) flooding, that optimizes uniform data survivability for 
all data at high network costs; and (2) broadcasting, that 
minimizes network costs with an intuitive penalty in data 
survivability of all types of data. While these techniques 
are well understood and have been thoroughly investigated 
in the past, our experimental evaluations contribute new un- 
derstandings for environments specific to robot teams. We 
continue with two other techniques that combine broadcasting 
and flooding but also take into account robot heterogeneity 
based on type of robot and task assigned. In particular, these 
techniques distinguish between scouts, mainly responsible to 
record data from the area of operation but prone to failures, 
and archivists, reliable nodes that ensure data delivery at the 
end of the mission. Finally, we propose an adaptive technique 



in Section 5.3 which takes into consideration not only the 
different roles (i.e. types) and anticipated failures of the robots, 
but also the different types of data collected and their different 
survivability requirements. 

5.1 Basic Data Replication Techniques: Flooding 
and Broadcasting 

Flooding (Fl): A straightforward replication technique is 
flooding. The robot-source transmits a message or file with 
a broadcast to all neighboring robots. The robots within 
transmission range that receive this message, independently 
of each other, decrement the time-to-live (TTL) (number of 
hops in the robot network) and retransmit if TTL > 0. This 
process is repeated until TTL becomes zero in all copies of 
the message. Flooding with large TTL achieves high data 
survivability in mobile ad-hoc networks due to higher con- 
nectivity. However, a robot network is usually sparse and with 
varying connectivity based on the mobility model, weather 
conditions, terrain and other obstacles. Thus, flooding even 
with a high TTL can introduce redundant retransmissions and 
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Fig. 4. Failure probability with respect to failure rate of scouts (SC) and archivists (AR) when assigned to working (WA) 
and connecting areas (CA), under the three failure models and for four hypothetical scenarios. For the independent 
robot failure model, we assume that scouts have 4 times the failure rate of the archivists (i.e., pi =4p 2 ). For the 
independent area and clustered area failure models, we assume that working areas have 4 times the failure rate of 
the connecting areas (i.e., X\ =4A 2 ). 



unnecessary usage of the robot wireless network interfaces 
and thus increased power consumption, without translating to 
increased data survivability, since the message may fail to 
reach nodes away from a failing area. Our experimental results 
in Section [6] provide more information on this topic. 

Broadcasting (Br): Another simple data replication tech- 
nique is to just broadcast the data as soon as they are created. 
All robots in transmission range receive and store locally the 
data broadcasted. The replication, hence, stops at this first hop, 
and these data are never replicated again. This is a special 
case of the flooding technique, with TTL= 1. This approach 
reduces network overhead and congestion in comparison to 
flooding, but it is likely to replicate less and thus lead to lower 
data survivability with respect to robot failures. 

5.2 Combined Flooding-Broadcasting Proactive 
Techniques for Data Replication 

To exploit the heterogeneity of robots deployed in a mis- 
sion, we consider the case where archivists and supervisors 
broadcast regularly HELLO messages while moving in their 
designated areas. These messages alert the scouts in proximity 
to send cumulatively all data they produced since the last re- 
ceived HELLO message. This action increases the probability 
of data to survive a failure of the producer and its replicas 
on the scouts nearby. The cumulative transmissions, however, 
subject the network to congestion: the data bursts transmitted 
by all scouts around an archivist in response to its HELLO 
message lead to the broadcast storm problem [42 1. 

Broadcasting and Cumulative Limited Flooding (Br- 
CLF1): In this approach, the scouts broadcast their data 
regularly. When they receive a HELLO message from an 
archivist, they cumulatively flood all their data since the last 
received HELLO message, with a limited TTL (much smaller 
than the full flooding). This technique tries to ensure that the 
data will reach the archivist within a few hops, even if the 
archivist was borderline passing the working area of the scout. 
An increased network overhead is expected, yet lower than that 
of the simple flooding approach with high TTL. However, 



simultaneous and cumulative data transmissions can lead to 
congestion at the archivist's wireless network interface and 
might limit data survivability. 

Broadcasting and Cumulative Broadcasting (BrCBr): 
Instead of flooding, the scouts cumulatively rebroadcast all 
their data since the last received HELLO message. This 
technique greatly reduces the network overhead, while trying 
to reach the archivist and increase the data survivability subject 
to network congestion and archivist reachability within the first 
network hop from the scouts. 

5.3 Failure- Adaptive and Delay-Tolerant Data Repli- 
cation 

The above techniques explore the trade-off between data 
survivability and network overhead imposed by replication 
without fully exploiting the particularities of the robot envi- 
ronment: 

1) Data have various degrees of importance and can be 
tagged accordingly with a desired level of survivability; 

2) Robot failures can be anticipated based on the type 
of robots and knowledge about the operation area. If 
failures are common, data need more replication. 

We propose a scalable technique that allows each scout to 
adapt the replication level by taking into account the an- 
ticipated failure rate of its 1-hop network neighborhood in 
conjunction with the survivability requirement of the data type 
to be replicated and the replication level it acquired from 
previous transmissions. We assume that before the mission, 
the human coordinators assess the situation and decide what 
type of failure model is expected and give an estimate of 
the failure rate. This failure rate can be set to a high level 
at the beginning of the operation, reflecting a pessimistic 
assessment of the situation and gradually change to lower 
levels as the coordinators have more empirical evidence at 
hand. They can also define a priority across different types of 
data which can be used to assess the survivability requirement 
for each data type. For example, in an earthquake scenario 
such as in Haiti 11461 or Japan [47], infrared video can be 
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assigned survivability requirement of 100% and humidity or 
temperature readings a survivability requirement of 50%. All 
robots transmit regularly HELLO messages that include their 
assigned area, expected failure rate, and data repository index. 

Algorithm [T] shows the high-level pseudo-code for the 
adaptive replication technique. When a robot is about to 
send a data item (line 5), the Failure Probability FP of 
the 1-hop network neighborhood is assessed (line 7) using 
the appropriate equation for the assumed failure model (i.e., 
eq. [T] eq. [2] or eq. [3}. This is possible due to the HELLO 
messages exchanged every few seconds between the robots. 
FP is an estimate of the survivability to be lost by the data 
item when broadcasted at that time in the 1-hop neighborhood 
(as explained in Section [4}. Thus, 1 — FP is the survivability 
acquired from the neighborhood. The remainder of the surviv- 
ability for the item, from its originally set value, is calculated 
in line 8. If the number of neighbor robots that do not have 
the item (line 8, as assessed by the HELLO messages) and 
the survivability remainder (line 10) are positive, the item's 
updated survivability requirement from the neighboring robots 
is defined as in line 11. Otherwise, it is set to zero because 
no more replication will be needed (line 14). In either case, 
this updated value is attached to the item, and the item enters 
the queue for immediate transmissions (line 16) from the 
particular robot. If there are no robots in the neighborhood 
or all of them have the specific item already stored (line 18), 
it means that FP was calculated taking into account only 
the local robot failure rate in line 7. In this case, the item 
enters the queue for delayed transmissions (lines 19-20). The 
delayed transmissions queue Q is examined every time the 
robot receives a new message such as a HELLO (line 38) 
or data item from other scouts (line 46). If Q is not empty, 
each delayed data item is examined for any remainder of 
survivability requirement (line 31) and the ScoutSendDataQ 
function is called (line 32) to handle it accordingly. A scout 
receiving an item (line 40), stores it locally (line 42) and if 
there is survivability remainder, the ScoutSendData{) function 
is called (line 44). Otherwise, if an archivist receives an item 
(line 48), it just stores it locally (line 50). 

6 Evaluation 

In our experimental evaluations, we aim to understand: a) what 
improvements the proposed adaptive replication technique 
brings, in terms of better meeting the data survivability re- 
quirements and reducing communication costs in comparison 
to flooding and broadcasting techniques, under different failure 
models, failure rates and data types; and, b) how environmental 
parameters such as the size of operation affect the relative 
performance of the data replication techniques. 

6.1 Experimental Setup 

We used the Network Simulator NS-2 ll35l for our evalu- 
ations, with the Shadowing Propagation Model, a path loss 
exponent of 3 (typical to urban environment simulations), and 
a standard signal deviation of 6 (for obstructed communication 
simulation). We considered a transmission range of 250m and 
a conservative bandwidth of 11Mbps for the IEEE802.1 1 



1 £ = {};// Data Queue for immediate transmissions 

2 Q = {}; // Data Queue for delayed transmissions 

3 S = {}; // Local Data Storage 

4 failure Jype € {1,2,3} 

5 ScoutSendData(data{dataID, surv_req}) 

6 begin 
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FP, 



f ailure_type 



failure _probability [failure Jype) 



FP 



diff <— surv _req(dataID) — (1 
if neighborhood _size > then 
if diff > then 

surv_req(dataID) -S— 
(dif f I neighborhood _size) ; 

end 
else 

| surv _req(dataID) <— 
end 

push data{dataID,surv_req} to B 



failure Jype ) 



end 
else 

if diff > then 

] push data{dataID,surv_req} to Q 
end 

end 

if B ^ empty then 

| transmit pending data items 
end 



12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 

26 end 

27 ScoutCheckQ() 

28 begin 

29 
30 
31 
32 
33 
34 

35 end 

36 ScoutReceiveHelloQ 

37 begin 

38 | ScoutCheckQQ 

39 end 

40 ScoutReceiveData(data{dataID, surv_req}) 

41 begin 

42 
43 
44 
45 
46 



Q 



for each data item in Q do 

pop data{dataID,surv_req} 4- 
if surv _req(datalD) > then 

j ScoutSendData(data{dataID,surv_req}) 
end 

end 



push data{dataID,surv_req} to S 
if surv _req(datalD) > then 

| ScoutSendData(data{dataID,surv_req}) 
end 

ScoutCheckQ() 

47 end 

48 ArchivistReceiveData(data{dataID, surv_req}) 

49 begin 

50 | push data{dataID,surv_req} to S 

51 end 

Algorithm 1: The pseudocode for failure-adaptive and delay- 
tolerant replication 
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protocol. In our experiments, we used two operation area 
sizes: a 2Km x 2Km university campus (Small Area) and a 
5Km x 5Km city region (Large Area). We limited the number 
of types of robots to three: scouts, archivists and supervisors. 
However, the archivists and supervisors are assigned the same 
task with respect to collecting and storing data for higher 
survivability, but in different types of areas. All robots are 
moving according to a modified Random Waypoint Mobility 
model within their assigned areas; we ensure that the archivists 
always move forward until they switch direction when reach- 
ing the limits of their area. We ran each experiment for 1000 
seconds of simulated time, enough for the archivists to cover 
their assigned connecting areas. Our simulation results were 
averaged over 5 randomly generated mobility scenarios. 

Scouts produce packets with three different survivability 
requirements at the same rate (a packet every 3-7 seconds), 
amounting to about 20,000 data items of each type during the 
simulation time. The only communication in these experiments 
is for data replication. In the adaptive technique, robots (scouts 
and archivists) include in their HELLO messages only the IDs 
of the last 10 data items received or sent. Even though this 
allows for a HELLO message of realistically small size of a 
few KBytes, it also limits the amount of repository history 
reported as logged on the robots. In order to compare the 
performance of this limited repository history (AdLH) with 
the ideal full repository history (AdFH) (i.e., when robots 
report all the repository of their locally logged data to reduce 
unnecessary duplicate transmissions), we ran simulations for 
the adaptive technique when the robots exchange a full history 
of the data items received or sent, while maintaining the 
same HELLO message size to keep the network overhead 
constant and comparable to the replication version with limited 
repository history. 

We applied the three failure models described in Sec- 
tion [4] with different failure rates within the range of 
0%, 10%, 20%,..., 90%. Furthermore, only scouts (i=l) are 
allowed to fail, whereas archivists (and supervisors) (i=2) 
are considered highly reliable and fault tolerant, thus pi ^ 
and P2 = 0. Similarly, the area failures (independent or 
clustered) are applied only on the working areas (j=l) and 
not on the connecting or monitoring areas (j=2), thus Ai ^ 
and X 2 = 0. Consequently, given the above restrictions, we 
have the following typical cases for the calculation of the 
failure probability. For the first failure type, when a robot has 
only scouts in its neighborhood, \RT\ = 1, and FP\ = (pi)" 1 
(from eq. [TJ. When the robot has scouts and at least one 
archivist or supervisor in its neighborhood, \RT\ = 2, and 
FP\ = {pi) a[ (p 2 )" 2 = (from eq. [TJ. For the second failure 
type, when a robot has only scouts in its neighborhood, 
\AT\ = 1, and FP 2 = (Ai) 6 ' (from eq. [2}. When the robot has 
scouts and at least one archivist or supervisor in its neigh- 
borhood, \AT\ = 2, and FP 2 = (Ai) 6 ' {X 2 ) bl = (from eq. |}. 
For the third failure type, when a robot has only scouts in its 
neighborhood, \AT\ = 1, and FP3 = Ai (from eq. [3JI. When the 
robot has scouts and at least one archivist or supervisor in its 
neighborhood, \AT\ = 2, and FP 3 = (Ai ) (A 2 ) = (from eq. 

We allowed an initial warm-up period of 100 seconds before 
failures start. Independent robot failures and independent area 



failures were uniformly distributed over time and the operation 
area. However, the clustered area failures occurred simultane- 
ously in the last 30% of the simulation time, mirroring the 
real-case scenario of a large localized area failure (explosion, 
bridge collapsing, etc). The ranges of values for the parameters 
used in simulations are shown in Table Q] We simulated the 5 
replication techniques presented in Section [5] with 10 different 
failure rates over 3 different failure types under 5 different 
mobility scenarios and 2 operation area sizes, for a total of 
1500 simulations. 

6.2 Performance Metrics 

To capture the total performance of the system at the end 
of the operation, we measure two cumulative performance 
metrics: 1) the Cumulative Deviation CD which evaluates the 
accuracy in meeting the data survivability objectives of the op- 
eration over all data types, and 2) the Cumulative Replication 
Factor CRF that evaluates replication redundancy and subse- 
quently communication overhead and redundant transmissions 
over all data types. 

Each data item of type s has a Survivability Requirement 
SR S , where s G S = {1,2,3} and SR S = {100%, 75%, 50%}. 
During an experiment with a failure type m E M = {1,2,3} 
and failure rate p G P = {0%, 10%, 20%, . . . ,90%}, data items 
of type s acquire an average survivability score, SA S which 
can be lower or higher than the respective SR S . We define this 
average score SA S of data items of type s, as follows: 

Distinct items of type s found on surviving robots 
at the end of the operation 

SA$ = 

Distinct items of type s produced by robots 
during the operation 

Then, the Cumulative Deviation (CD) for a given failure 
type m and rate p, over all data types s, is defined as follows: 

CD = \SA S - SR S \ , Vm G M , Vp G P (4) 

seS 

The ideal CD is 0, i.e. the replication technique matches 
exactly the data survivability needed by all data types with 
their data survivability acquired over the operation. Given that 
we apply three types of data with SR S = {100%, 75%, 50%}, 
by definition, the starting CD for failure rate p = 0%, for all 
failure types and replication techniques, will be CD = 75. 

Each data item of type s can be replicated multiple times 
on various robots. We define the replication factor RF S of data 
items of type s, as follows: 

Number of items of type s found on surviving robots 

at the end of operation 

RF, = 

Distinct items produced by robots 

during the operation 

Then, the Cumulative Replication Factor (CRF) for a 

given failure type m and rate p, over all data types s, is defined 
as follows: 

CRF = RF S , Vra G M , Vp G P (5) 

seS 
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Parameter 


Values Used 


Operation Area size 


5Km x 5km, 2Km x 2km 


Number of Robots 


115 


Working Areas 


Size: 100m x 100m 

Number of areas: 33 (randomly placed in the operation area) 
Robots: 3 scouts/working area 
Speed Range: \-$m/s 


Connecting Areas 


Size: 5Km x 50m, 2Km x 50m 
Number of areas: 6 (3 vertical, 3 horizontal) 
Robots: 2 archivists per connecting area 
Speed Range: 5— lOm/s 


Monitoring Areas 


Size: 2.5Km x 2.5Km, iKm x iKm 

Number of areas: 4 (equally dividing the operation area) 

Robots: 1 supervisor per monitoring area 

Speed Range: 10— \5m/s 


Data item size 


500 Bytes per packet 


Data item creation period 


3-7 seconds 


HELLO message period 


8-12 seconds 


Failure Rates 


0%, 10%, 20%,..., 90% 


Survivability Requirement 


50%, 75%, 100% 



TABLE 1 

Simulation parameters and values used 



6.3 Results for Large Operation Area (5Km x 5Km): a 
City Region 

Figure [5] presents the results on cumulative deviation for the 
five different replication techniques, under the three failure 
models and different failure rates, for a large operation area 
of 25 square kilometers. We notice that localized area failures 
lead to a higher cumulative deviation compared to independent 
robot failures, due to the large operation area and subse- 
quent sparsity of the robot network. The adaptive technique 
(AdLH and AdFH) performs considerably better, adjusting the 
replication efforts to match the survivability requirements: for 
low and medium level failure rates, i.e., below 60-70%, the 
cumulative deviation across all types of data is almost half 
in comparison to the other techniques, for all three types of 
failures. The broadcasting techniques (Br and BrCBr) perform 
fairly well in comparison to the flooding technique (BrCLFl). 
We decided to exclude the simple flooding from the graphs, 
as it performs similarly to BrCLFl but with higher network 
overhead. 

Figure [6] shows the cumulative replication factor for the five 
different replication techniques, under the three failure models 
and different failure rates for the large operation area. We 
notice that the adaptive technique creates the lowest number 
of replicas and thus induces the lowest network overhead 
in the system when the anticipated failures are in low or 
medium levels, i.e., below 60%. For high level of failure rates, 
i.e., above 60%, the adaptive technique performs similarly to 
broadcasting (Br and BrCBr). Flooding (BrCLFl) performs 
the worst: even though the technique first broadcasts and then 
cumulatively floods on a small radius around the transmitting 
scout (i.e., with a small TTL), the redundant replication 
induced by the flooding part of the method increases its 
cost dramatically, especially when the failure rates are at low 
levels, i.e., below 30%. As shown in Figures [5] and [6] the 
adaptive method with full history performs better than the 
other techniques for an additional 10-20% of failures. This 



demonstrates that the full history helps meet the survivability 
requirements even better than the limited history (as shown by 
the performance metric CD), while reducing communications 
and redundant transmissions (as shown by the performance 
metric CRF). 

6.4 Results for Small Operation Area (2Km x 2Km): a 
University Campus 

In the small operation area setup, the monitoring and con- 
necting areas are smaller, while the working areas remain the 
same size, but placed closer to each other, increasing network 
density. All other parameters are preserved the same. 

Figure [7] shows the cumulative deviation for the five differ- 
ent replication techniques, under the three failure models and 
different failure rates, for a small operation area of 4 square 
kilometers. In contrast to the large operation area, we observe 
similar performance patterns for all three types of failures 
due to a significant increase in network density and thus 
reachability between robots. The adaptive technique exhibits 
half the cumulative deviation of the other techniques, for low 
and medium level failure rates, i.e., up to 60%. As expected, in 
this smaller area of operation (about 6 times smaller than the 
large area), flooding is not a viable option as it deviates the 
most from the data survivability requirements. Nevertheless, 
broadcasting performs well, especially in high failure rates. 

Figure [8] shows the cumulative replication factor for the 
five different replication techniques, under the three failure 
models and different failure rates for the small operation area. 
We notice that the adaptive technique does not perform as 
well as in the large area due to the overhearing effect: 1) The 
transmitting robot calculates a possible failure probability for 
its neighborhood and sends an item. 2) Robots in transmis- 
sion range that were not accounted in the failure probability 
calculation (e.g. their HELLO message was dropped, or came 
within reach afterwards), receive the item, store it locally and 
contribute to its acquired survivability, but also to its repli- 
cation factor. 3) However, the transmitting robot calculated a 
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Fig. 5. Large area, Cumulative Deviation (CD) for five different replication techniques. Br Broadcast, BrCBr: Broadcast 
and Cumulative Broadcast, BrCLFI: Broadcast and Cumulative Limited Flooding, AdLH: Failure Adaptive and Delay 
Tolerant (limited repository history), AdFH: Failure Adaptive and Delay Tolerant (full repository history). 
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Fig. 6. Large Area, Cumulative Replication Factor (CRF) for five different replication techniques. Br: Broadcast, BrCBr: 
Broadcast and Cumulative Broadcast, BrCLFI: Broadcast and Cumulative Limited Flooding, AdLH: Failure Adaptive 
and Delay Tolerant (limited repository history), AdFH: Failure Adaptive and Delay Tolerant (full repository history). 




Fig. 7. Small area, Cumulative Deviation (CD) for five different replication techniques. Br. Broadcast, BrCBr Broadcast 
and Cumulative Broadcast, BrCLFI: Broadcast and Cumulative Limited Flooding, AdLH: Failure Adaptive and Delay 
Tolerant (limited repository history), AdFH: Failure Adaptive and Delay Tolerant (full repository history). 
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higher failure probability, thus the needed survivability from 
the receiving robots is set to a higher level and more replicas 
are potentially applied to reach the survivability requirement. 
Flooding performs the worst, inducing high communication 
and redundant replication (more than an order of magnitude 
in comparison to the large area). 

6.5 Summary of Experimental Results 

Our simulations of urban search and rescue scenarios within 
disaster environments show how the performance of various 
replication techniques depends on the operation area size and 
the type and rate of failures. Especially in a large operation 
area, data survivability requirements are more difficult to meet 
when area failures (independent or clustered) are considered, 
in comparison to the independent robot failures. Therefore, 
the replication technique used must be more adaptive and 
aggressive, to provide higher guarantees of meeting the data 
survivability requirements. On the other hand, a smaller op- 
eration area leads to similar performance patterns across all 
failure types, due to higher network density. Thus, in such an 
operational setup with high network density it is less critical to 
specify exactly the failure type anticipated during the mission. 

However, from the results in both operation areas, the 
failure rate is the decisive factor for the performance in data 
survivability, and can be discretized into three ranges: low 
(0% - 30%), medium (30% - 60%) and high (60% - 90%). 
Consequently, the human coordinators can choose among these 
three levels instead of an exact failure rate. Furthermore, the 
failure rate level can be assumed high in the beginning of 
the mission and gradually reduced to lower levels as the 
coordinators have more empirical evidence at hand. 

The amount of data that survives in the network depends on 
how aggressive the technique is to replicate in other than the 
producer's area, compensating for the replicas to be lost due 
to local robot failures. The adaptive technique best matches 
the survivability requirements of data for up to medium level 
of anticipated failure rates (up to 60%), while reducing the 
number of replicas in the network (compared to flooding 
and broadcasting techniques), thus reducing communications 
overhead, battery and storage usage. 

7 Related Work 

The major differences between our work and previous studies 
is along three coordinates: (i) heterogeneity in nodes, types 
of failures, and data requirements; (ii) large-scale network- 
level simulations, and (iii) data replication in the context of 
a heterogeneous environment. 

Studies that have considered node heterogeneity assume 
that some nodes are more stable and more resourceful than 
others. CLEAR ||30l deploys a super-peer architecture that 
exploits relatively stable peers having maximum remaining 
battery power and processing capacity among their regional 
neighbors to determine a near-optimal reallocation period 
based on mobile host schedules. In |lj, resourceful nodes serve 
as cores to enable core-aided routing. Replication schemes 
examined include "copy-to-core", where both regular nodes 
and core nodes are carriers of messages to the destination, and 



"dump-to-core", where the regular nodes delete the messages, 
leaving the cores to deal with the delivery. Such nodes, due 
to their extended resources, acquire a similar role to the 
archivists and supervisors presented in this work. However, 
unlike in our work, no differentiation was made with respect 
to their mobility or failures exhibited and how they affect the 
replication effort. 

Node failures in robotic networks or MANETs are usu- 
ally considered independent and homogeneous. A deviation 
from the independent node failure model was studied in 
Replic8 [27 1 which considered clustered node failures that 
affect multiple nodes at the same time. Replic8 performs 
location-aware file replication and tackles localized network 
failures by storing replicas at faraway network locations and 
achieves high availability with fewer replicas as compared to 
replicating at random locations. Similarly, our work applies 
failure models that affect multiple nodes at the same time. 
However, we distinguish between independent area failures, 
where same team robots working in an area can fail at the 
same time independently of neighboring teams, and clustered 
area failures, where multiple neighboring areas can fail at the 
same time. 

Heterogeneity in data has been used in prioritized epidemic 
routing (PREP) (38], were bundles of data are prioritized 
based on cost to destination, source, and expiration time. 
Costs are derived from per-link "average availability" infor- 
mation that is disseminated in an epidemic manner. PREP 
maintains a gradient of replication density that decreases 
with increasing distance from destination. In our work, we 
also assume that data are replicated based on different re- 
quirements or priorities. However, these requirements are due 
to the significance to survive the mission, as assumed by 
the human coordinators and not due to their producer, their 
destination or associated lifetime. Our adaptive replication 
technique adjusts (i.e., increases) the survivability acquired by 
the data as they are replicated across new robots, until their 
accumulated survivability meets their original requirements set 
by the coordinators. 

Research in MANETs traditionally uses simulations at net- 
work level with a focus on scalability, but considers nodes 
homogeneous and failures independent; in robotics research 
the emphasis is more on studying the features of a partic- 
ular node or improving the collaboration between nodes in 
relatively small teams, and thus traditionally focuses less on 
scale or low-level communication aspects. Our study bridges 
this gap by leveraging MANET traditional tools (low-level 
network simulators, in this particular case NS-2) in a large- 
scale environment populated by heterogeneous mobile nodes 
with differentiated characteristics defined by robotics scenarios 
such as available robot resources and sensor data production, 
robot mobility and anticipated failures. 

Data replication has been studied extensively in MANETs. 
Three replication techniques to improve data accessibility are 
presented in ifTTl [T8l where data access frequency is used 
to place replicas, eliminate duplications among neighboring 
hosts, and share replicas among stable groups of hosts. In 1 10|, 
nodes exchange availability information by broadcasting ad- 
vertisement messages to decrease redundancy and create a data 
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lookup table within their group. Group partitioning prediction 
allows replication of data before partitioning occurs. In ll25ll 
Jing et al propose an algorithm that takes into account the 
nodes' motion and tries to minimize the communication cost of 
data access. The algorithm attempts to adapt dynamically the 
replica allocation scheme to a local optimal scheme. A theo- 
retical approach that quantifies the effect of data replication on 
availability in MANETs is presented in |Q3). In El) Zheng et 
al use read and write statistics to define the level of replication 
in a neighborhood of nodes. In [28 1, the peers define data 
cache policies based on requests to access particular data. In 
contrast to these studies on data replication for availability 
and accessibility in real time, we focus on the problem of 
data survivability, which is data durability for the duration of 
the operation, given different data survivability requirements 
and node failures. 

Employing lessons from [ 13] on UAVs, our adaptive method 
uses opportunistic communication (similar to delay tolerant 
networking) between all types of robots (UGVs, USVs and 
UAVs), to improve the survivability of data in the system in 
face of robot and network failures. Several studies (e.g., ifTlfTSI 
[T6l [T9l l24l [38l l40l ) focus on challenged networks and employ 
delay-tolerant techniques for data replication. A disruption 
resilient content dissemination approach for MANETs ll24ll 
exploits the in-network storage and hop-by-hop dissemination 
of named information objects. A broadcast scheme for delay 
tolerant networks is presented in |16|, where a node transmits 
a message only when at least one node in range does not 
have the particular message. Our adaptive technique applies a 
similar approach: upon receiving a HELLO message from an 
archivist, a scout reactively transmits the history of data since 
the last HELLO from an archivist. With "Spray and Wait" |40|, 
a node "sprays" a number of copies into the network and then 
"waits" until one of the sprayed nodes meets the destination. 
In lfT51l . epidemic routing is used as a controlled flooding tech- 
nique, where a pair of nodes exchange missing packets when 
into contact. Given enough storage space, epidemic routing can 
be used to reliably disseminate data across the network. An 
evaluation of different controlled message flooding schemes 



over disconnected sparse mobile networks is presented in 1 19 1. 
Similarly to these efforts, our adaptive technique also uses 
a delay-tolerant mechanism. However, our ultimate goal is 
to increase the survivability of data with the least network 
overhead induced in the system. Additionally, instead of high 
density mobile networks comprised of homogeneous nodes 
producing one type of data, we consider scenarios with low, 
non-uniform node density and heterogeneous nodes which 
continuously produce data of different types and with different 
survivability requirements. 

Significant research has been done to provide realistic 
mobility models for MANETs. Of particular relevance for our 
work are group mobility models, such as the ones presented 
in 0, GO), OH and 05]. These can be used to model the 
robot mobility inside the areas defined in this paper, instead of 
the Random Waypoint Model used. Also, a model for realistic 
representation of the movement of civil protection units in a 
disaster area scenario was presented in |2|. This study focused 
on a small scale operation area such as a collapsed building 
(hundreds of square meters) with a high network density and 
high reachability between nodes. Instead, our study considers 
large operation areas (tens of square kilometers) that include 
several small working areas (e.g., collapsed buildings) and 
nodes forming a fairly sparse network. Moreover, node fail- 
ures during the operation reduce further network density and 
challenge data survivability. 

8 Conclusions 

Robots typically have more resources than, for example, smart 
phones or sensors, but are more prone to fail. At the same 
time, the data they collect when deployed in urban disaster 
environments can be vital for search and rescue efforts. This 
paper studies the problem of data survivability in mobile 
robot networks by acknowledging that, unlike nodes whose 
mobility patterns cannot be externally controlled or accurately 
predicted (e.g., cell phones), more information can be available 
in robot networks and can be used for failure-resistant data 
collection in disaster scenarios. Our work bridges the fields 
of MANETs and Robotics by studying the data survivability 
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using large-scale, network-level robotic simulations, while 
leveraging human assessment of the environment and robot 
heterogeneity. 

We proposed a delay-tolerant failure-adaptive replication 
technique that takes into account the following mission- 
specific and environmental parameters. First, data collected 
during a mission have different survivability requirements 
based on their importance to the human crews. This mission- 
specific parameter can be easily estimated function of the 
existing sensors for collecting data and the deployment sce- 
nario. For example, in a nuclear power plant accident, video 
footage from the on-board robot cameras while inspecting 
the structural integrity of the site can be assigned a surviv- 
ability requirement of 100%, radioactivity level readings can 
be assigned a survivability requirement of 75% (by default, 
rescue crews enter the site with radioactive-resistant suits), 
whereas temperature and humidity readings can be assigned a 
survivability requirement of 25%. 

Second, robots are assigned different tasks based on their 
hardware characteristics and mission. For example, robot 
scouts can work as teams to collect data and complete a 
common task. On the other hand, robot archivists can follow 
paths that connect scout teams, for better control, wireless con- 
nectivity and data backups. Depending on the robot type and 
assigned task, robots can exhibit different failure rates, which 
makes robot heterogeneity an important mission-specific pa- 
rameter. 

Third, robotic failures of various types and rates lead to data 
loss and greatly affect the success of a mission. We classified 
these failures into three categories: independent robot failures, 
independent area failures and clustered area failures. Human 
operators could estimate the type of robot failure expected 
during a mission based on environmental conditions and robot 
types used. For example, urban architecture information such 
as the placement of gas pipes or topology could lead to 
explosions or, respectively, flooding, and thus to clustered area 
failures, whereas a preliminary or conservative estimation of 
the state of a building or bridge structure could determine area 
failures. Finally, empirical estimation of robot reliability based 
on hardware specifications and task could suggest independent 
robot failures. 

The rate of failures is difficult to estimate accurately, 
especially for clustered and independent area failures. We 
acknowledge that in many cases the value of the anticipated 
failure rate may be just an educated guess. However, even in 
such cases, instead of the exact rate, an average level of failure 
rate could be used. Our results indicate three general levels 
of failure rate that can inform adaptive replication decisions: 
given a pessimistic estimation of the situation, the operation 
coordinators can set the failure rate to a high level, and later on 
reduce it to medium or low levels based on empirical evidence. 

These parameters are used by our replication technique to 
find the right tradeoff between replication (and thus resource 
consumption, such as battery, communication volume and stor- 
age) and data survivability. Our technique utilizes a distributed 
mechanism to estimate the probability of a neighborhood to 
fail. Each robot uses this estimation to adapt the replication 
for each type of data based on the failures anticipated in its 



neighborhood, which is a strong indicator of the survivability 
that the data can acquire during the mission. As a future 
extension, this mechanism could also be used by each robot to 
update its failure rate at real time during the mission, instead 
of being set by the coordinators. Thus, a robot could start at a 
default failure rate and by updating this rate based on feedback 
from the observed failure probability of its neighborhood, 
it could gradually converge to a stable and more accurate 
failure rate. This situation would lead in the beginning to 
higher communication costs than necessary, yet still lower 
overall than the baseline solutions with which we compared 
our adaptive technique. 

For our experimental study we proposed and used novel 
frameworks for generating realistic mobility and failure sce- 
narios for mobile robot networks when deployed for search 
and rescue missions in urban disaster scenarios. These frame- 
works were adapted for the Network Simulator NS-2 and 
allowed us to experiment with different sizes and types of 
areas, for different numbers and types of robots assigned 
in each area, with variable mobility pattern for each robot 
type, and finally different failure models: independent robot 
failures, independent area failures and clustered area failures. 
Extensive network-level simulations demonstrated that our 
adaptive replication technique allows a mission to withstand 
failure rates of up to 60% of the robots for all three failure 
models examined, while resulting in better data survivability 
than flooding and broadcasting-based techniques and without 
inducing higher communication costs. 

Deployment of ad-hoc collaborative robot teams is currently 
done at a small scale and in controlled environments. The 
reliability of robots and survivability of their data in the 
presence of environmental hazards need more studies through 
realistic simulations to allow the emergence of large-scale self- 
coordinating mobile robot networks. Our study contributes to 
more realistic modeling of robot placement, mobility, failures 
and survivability of their data in urban disaster environments. 
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